### Configuration specs for running Anomaly Detection as a Watson Studio Job

This file describes various configuration settings to drive the Anomaly Detection notebook in the auto pilot mode. The steps to use the Anomaly Detection (Config Based).ipynb using external settings, use the file `ad_config_standard.json` in the `data` folder as the basis, and modify the parameters according to your dataset. The description of each parameter and their allowed values are provided below

|Parameter|      Description      |Allowed Values|System Default|
|---------|----------------------------|--------------|--------------|
data_source|Indicates if the data for training is fetched from the MAS monitor data lake or CSV files |mas_monitor_data_lake or CSV |mas_monitor_data_lake    
asset_group_label|The label for the asset group in Maximo|User's choice. It should match the value in maximo; for example: IIOT_DEVICE_AD_EXPLNR| None, but this is mandatory        
device_type|Device type used at the Monitor level|User's choice; for example MAS_IIOT_DEVICE_C1|None, but mandatory        
sensor_data_file|CSV file containing the sensor measurements, asset Id and timestamps. Should be available in Watson Studio project's data assets|User's choice. For example: sensor_train_unsupervised.csv |None, but mandatory if the parameter `data_source` is CSV        
validation_data_file|CSV file with the labled data. Should be available in Watson Studio project's data assets |User's choice. For example: sensor_train_labeled.csv |None, but mandatory if data_source is CSV and the parameter `learning_mode` is set to semi-supervised     
asset_id_column_name|The column of the data frame that contains the Id of the asset|User's choice|id    
timestamp_column_name|The column of the data frame that contains the timestamp when the measurements in that particular row were recorded|User's choice|evt_timestamp      
source_variables|A **list** of column names containing the raw variables excluding the asset Id and timestamp columns. This should be provided as a list of values|Depends on the user's data|None, but mandatory. If not provided, the system will try to infer from the data frame          
data_resampling|If the training data needs to be resampled, provide the value for the sampling window in time units|Any value using Pandas timeseries alias | If this parameter is not provided, no resampling will be done. This is not a mandatory parameter. So no system default       
learning_mode|Choice of unsupervised vs semi-supervised approach. If semi-supervised approach is chosen, labelled data should be provided|unsupervised or semi-supervised|unsupervised    
enable_temporal_features|Flag indicating if temporal / statistical features are needed|"True" or "False" as string|False. By default no features will be created    
rolling_window_size|rolling window size for temporal features. Not to be mixed with the `data_resampling` above|Any value using pandas timeseries alias|None. This is not a mandatory parameter. However if `enable_temporal_features` is set to "True" a suitable value for this parameter must be provided           
minimum_periods|Value for minimum periods. Refer to Pandas decumentation|Pandas aliases|None. This is not a mandatory parameter    
simple_aggregation_functions|Simple Temporal statistics|Provide this input as a list containing one or more labels from this string `['min', 'max', 'mean', 'std', 'sum', 'count','median']`|None. This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `higher_order_aggregation_functions` or `advanced_aggregation_functions`. All three sets of aggregation functions can also be used     
higher_order_aggregation_functions| Higher order temporal statistical functions  | Provide this input as a list containing one or more labels from this string `['sum','skew','kurt','quantile_25','quantile_75','quantile_range']`   |None.  This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `simple_aggregation_functions` or `advanced_aggregation_functions`. All three sets of aggregation functions can also be used    
advanced_aggregation_functions|   |   `['rate_of_change', 'sum_of_change', 'absolute_sum_of_changes', 'trend_slop','abs_energy', 'mean_abs_change', 'mean_change', 'mean_second_derivate_central', 'count_above_mean', 'count_below_mean']`  |None.  This is not a mandatory parameter. But if `enable_temporal_features` is set to "True" then atleast one of the aggregation functions must be provided - either this one or the `simple_aggregation_functions` or `higher_order_aggregation_functions`. All three sets of aggregation functions can also be used       
wml_deployment|Flag indicating WML deployment is needed|"True" or "False"|"True"    
wml_deployment_space_name|The name of the WML deployment space for the model deployment|Users choice|None. This is not mandatory unless `wml_deployment` parameter is set to "True"       
wml_model_type|The model type as per the WML alias|Any of the WML supported aliases|scikit-learn_1.3    
wml_base_software_spec_name|Software specification name as per the WML alias|Any of the WML supported aliases|runtime-24.1-py3.11  
wml_model_name|Name for the model to be deployed in WML.not used in this version of the notebook|User's choice. Suggest a short name within 5 - to characters|System will use a standard name           
wml_model_description|Description of the model deployed on WML|User's choice|The system will create a default description      
data_quality_analysis|Define this spec using the example to perform the data quality analysis on the input data|See the example and documentation below. This is defined as a dictionary | None     
missing_value_analysis|This is part and step 1 of the data quality analysis|See example in the provided json file|None     
missing_value_thresholds|Using JSON syntax define the missing value threshold for each variable of interest. This is the upper level of the percentage missing values for a given column|None    
stop_if_missing_values_exceed_threshold|If this is set to "True" the notebook will stop executing when it encounters at least one column or variable that has missing values exceeding the user specified threshold|"True" or "False" (strings - not Python boolean types)|"False"       
auto_imputation_config|If values are missing, this config provides the parameters for auto imputation|see the example file|None    
use_mcar|If this is set to "True" then only variables that appear to contain missing values completely in random (MCAR) will be imputed|"True" or "False" (strings - not python boolean types)|"True         
execution_params|Specification for the ML pipelines|Follow the example provided in the JSON file|System defaults. This entire block is optional
execution_type|Hyperparameter search strategy|one of the values as a string from this list [`single_node_complete_search`, `single_node_random_search`, `spark_node_random_search`,`spark_node_complete_search`,`evolutionary_search`,`rbfopt_search`,`hyperband_search`,`bayesian_search`] |`spark_node_random_search`    
number_of_option_per_pipeline|Number of parameter settings that are sampled. This parameter is applicable for `spark_node_random_search` and `single_node_random_search` execution types|An integer depending on the parameter space|10     
maximum_evaluation_time_per_pipeline|Maximum timeout in minutes for execution of pipelines with unique parameter grid combination. This parameter is applicable for `spark_node_random_search` and `spark_node_complete_search` execution types   |Any integer depending on the expected running time for the given number of pipelines and the dataset|2     
total_execution_time|Total execution time in minutes for all the pipelines|Any integer depending on the expected running time for the given number of pipelines and the dataset|10   
random_state|Random state to initialize and get consistent results|Any integer|42    
log_level|Level of logging|low or medium or high|low     
scaling|The scaling operation to be done to the dataset|normalize or standardize or robust|No scaling will be applied. This value is case insensitive   
normalization_range|If normalization needs to be applied to the data, this specifies the range|Any integer range as a list of two values - lower and upper limit|[0,1]     
scoring_method|Choice of core computation algorithm| One of the strings in this list [`em_score`, `mv_score`,`al_score`]|`em_score`     
threshold_criterion|Choice of computing anomaly threshold|One of the values from the list. See example JSON file [`contamination`,`qfunction`,`std`,`adaptivecontamination`, `medianabsolutedev`,`otsu`]|{'std':2.0}        
estimator_selection_criteria|The criterion for the notebook to pick the right estimator from the list|One of the values from the list [`maximum_anomalies`,`minimum_anomalies`]|`minimum_anomalies`        
include_extended_algorithms|Choice of using the full stack with all combinations of pipelines. Note that setting this to "True" will consume a lot of resources, and take a lot of time for the notebook to finish|Either "True" or "False" as string|"False"     
include_covariance_based_techniques|Choice of including or excluding covariance based estimators of sklearn. Note that setting this to "True" will consume increasing levels of resources, and lot the notebook to finish. The time will increase depending on the number of features|Either "True" or "False" as string|"False"     
use_specific_estimators|If the user wants to use a specific estimator or estimators, specify them here. Only these estimators will be tried with the dataset after scaling, if the scaling parameter is configured|One or more values from the list [`isolationforest`,`nearestneighboranomalymodel`,`mincovdet`,`anomalyensembler`,`nsa`,`predictonly_anomalyensembler`,`anomalyrobustpca`,`extendedisolationforest`,`lofnearestneighboranomalymodel`,`neuralnetworknsa`,`anomalypca_t2`,`anomalypca_q`,`samplesvdd`,`empiricalcovariance`,`ellipticenvelope`,`ledoitwolf`,`oas`,`shrunkcovariance`,`oneclasssvm`,`gaussiangraphicalmodel`,`gmmoutlier`,`cusum`,`kerneldensity`,`graphpgscps`,`hotellingt2`,`spad`,`extendedspad`,`oob`,`DNNAutoEncoder`,`ggm_snn`,`ggm_kl_div_dist`,`ggm_kl_divergence`,`ggm_frobenius_norm`,`ggm_likelihood`,`ggm_spectral`,`ggm_mahalanobis`,`ggm_sparse_subgraph`,`graphquic`,`randompartitionforest`]|A system default involving the estimtors in this list will be used - [isolationforest, nea, oranomalymodel, lofnearestneighboranomalymodel, anomalyensembler, anomalyrobustpca, anomalypca_t2, anomalypca_q, oneclasssvm, gmmoutlier]       
exclude_specific_estimators|If the user wants to eliminate one or more estimators from consideration, they can be specified here as a list|The same list provided above|Nothing is excluded by default        
monitor_deployment|If depployment in Monitor is needed|provide the spec as shown in the JSON example. If this spec is provided the system will assume Monitor deployment is desired|None
model_instance_name|Any string as a unique name for the deployment|Any user specified string|System default    
model_instance_desc|Any string as a unique name for the deployment|Any user specified string|System default    
write_initial_result|Choice of whether or not to write the results of training to the database|Either "True" or "False" as string|"True"      
model_upgrade|If the model is to be updated upon retraining without losing the past state and predictions.Set this value to "True" in this case|Either "True" or "False" as string|"False"     
enable_model|If the model needs to be enabled for scoring|Either "True" or "False" as string|True      
scoring_schedule|The scoring schedule for ongoing scoring job|A dictionary as shown in the JSON example. `{'starting_at':Either a UTC time or the string "in_5_minutes", 'every': Pandas time alias. Default is '1D'}`  If "in_5_minutes" is provided, the model will be enabled to score starting the next 5 minutes|`{'starting_at':"in_5_minutes", 'every':'1D'}`               
